Performance of artificial intelligence in predicting the prognossis of severe COVID-19: a systematic review and meta-analysis

Background COVID-19-induced pneumonia has become a persistent health concern, with severe cases posing a significant threat to patient lives. However, the potential of artificial intelligence (AI) in assisting physicians in predicting the prognosis of severe COVID-19 patients remains unclear. Methods To obtain relevant studies, two researchers conducted a comprehensive search of the PubMed, Web of Science, and Embase databases, including all studies published up to October 31, 2023, that utilized AI to predict mortality rates in severe COVID-19 patients. The PROBAST 2019 tool was employed to assess the potential bias in the included studies, and Stata 16 was used for meta-analysis, publication bias assessment, and sensitivity analysis. Results A total of 19 studies, comprising 26 models, were included in the analysis. Among them, the models that incorporated both clinical and radiological data demonstrated the highest performance. These models achieved an overall sensitivity of 0.81 (0.64–0.91), specificity of 0.77 (0.71–0.82), and an overall area under the curve (AUC) of 0.88 (0.85–0.90). Subgroup analysis revealed notable findings. Studies conducted in developed countries exhibited significantly higher predictive specificity for both radiological and combined models (p < 0.05). Additionally, investigations involving non-intensive care unit patients demonstrated significantly greater predictive specificity (p < 0.001). Conclusion The current evidence suggests that artificial intelligence prediction models show promising performance in predicting the prognosis of severe COVID-19 patients. However, due to variations in the suitability of different models for specific populations, it is not yet certain whether they can be fully applied in clinical practice. There is still room for improvement in their predictive capabilities, and future research and development efforts are needed. Systematic review registration https://www.crd.york.ac.uk/prospero/ with the Unique Identifier CRD42023431537.


Introduction
The novel coronavirus disease 2019 (COVID-19) pandemic, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has had a profound effect worldwide (1).As of December 2023, the World Health Organization has reported 773,119,173 COVID-19 cases (2).Currently, the medical community has conducted extensive research on the infection and pathogenesis mechanisms of COVID-19 (3).A series of measures have been implemented to control its spread and infection (4), such as the development of vaccines and the implementation of policies (5,6).
While most individuals infected with SARS-CoV-2 experience either no or mild respiratory symptoms, a small percentage develop severe COVID-19 pneumonia or acute respiratory distress syndrome (ARDS).These cases can be life-threatening and necessitate intensive care or tracheal intubation (7)(8)(9).Studies have indicated that patients who undergo invasive mechanical ventilation are vulnerable to secondary infections, which can further increase the mortality rate (10).Additionally, a study has indicated that the infection fatality rate (IFR), the anticipated ratio between deaths and infections, among COVID-19 patients, exponentially increases with age (11).Furthermore, as the post-pandemic era unfolds, numerous previously infected individuals are expected to experience complications and sequelae, some of which may be severe and fatal (12).Consequently, accurate prognosis prediction is crucial for effectively managing these cases.
In recent years, there have been significant advancements in technology, leading to the continuous evolution of computeraided techniques.These techniques have gradually developed into a diverse set of diagnostic and prognostic systems, with a particular focus on the field of medical imaging.These systems are designed to perform various tasks, including classification, regression, segmentation, and tracking (13).Artificial intelligence (AI) in medicine is evolving, reshaping medicine, and improving the experience of clinicians and patients (14)(15)(16).Research has demonstrated that artificial intelligence and machine learning can outperform clinical doctors in disease prediction on certain occasions (17).
In the context of COVID-19, a considerable number of studies have been conducted, employing artificial intelligence models for the purposes of diagnosis, treatment, and prediction (18)(19)(20).In intensive care respiratory medicine, artificial intelligence has made initial achievements in the prognostic prediction of diseases and has gradually become an auxiliary diagnostic tool for clinicians (21,22).Recently, a study successfully developed a random forest model to predict hypotensive events in the Intensive Care Unit (ICU).The model exhibited an impressive sensitivity of 92.7%, enabling the prediction of these events up to 15 min in advance (23).
Multiple studies have documented the utilization and advancement of AI in prognostic prediction for critically-ill COVID-19 patients.However, there is a dearth of comprehensive evaluations regarding its effectiveness.Consequently, the true potential of current AI technology in clinical practice remains uncertain.To address this gap, our study aims to investigate how well AI methods provide reproducible prognostic predictions.

Protocol and registration
This study was performed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement and involved a secondary analysis based on published researches (24).This study was registered in the International Prospective Register of Systematic Reviews (PROSPERO) database (Registration Number CRD42023431537).Ethics approval was not obtained for this study.

Database and search strategy
We searched PubMed, 1 Web of Science,2 and Embase 3 for all studies published before October 31, 2023, on the use of artificial intelligence techniques to predict the death of patients with severe COVID-19.Our search strategy included a combination of controlled vocabulary terms (NCBI's MeSH terms) and free keywords.The keywords used encompassed terms such as "Critical Care, " "Artificial Intelligence, " and "COVID-19." An example of the search strategy employed on PubMed can be found in Table 1.

Exclusion criteria
(1) Studies with missing outcome indicators, unavailable data, or inconvertible data were excluded, as were (2) duplicated reports and (3) review reports.(4) The original articles were not included.

Screening of literature
After literature retrieval, repeated studies were excluded using the software.Two researchers read the titles and abstracts, screened them according to the inclusion and exclusion criteria, and obtained the full texts of the remaining literature.If the original text could not be obtained from the Internet, the author of the original text was contacted, and the full text was read and further screened.

Data extraction
Two researchers independently extracted the following literature data: author, publication year, study type, prediction model category, country, income level [high-income and non-highincome economies as defined by the World Bank (25)], total number of patients, actual number of dead patients, actual number of surviving patients, predicted number of dead patients, and predicted number of surviving patients (only the optimal model data of the external validation set were extracted; if not explicitly specified, the dataset provided in the article was assumed to be the validation set).After both researchers completed data extraction, the results were cross-checked, and discrepancies were discussed and finalized.

Risk of bias
Individual study bias was independently evaluated by two researchers using PROBAST 2019 (26), which includes two parts: risk of bias and applicability.The risk of bias section evaluates mainly participants, predictors, outcomes, and analyses.The applicability section evaluates the participants, predictors, and outcomes.

Evidence quality evaluation
In this study, we utilized the GRADE (Grading of Recommendations, Assessment, Development, and Evaluation) system to evaluate the credibility of the findings and assign a level of recommendation (27).It categorizes the evidence quality into four levels: high, moderate, low, and very low, considering factors such as quality of evidence, consistency of results, directness, precision, and potential bias.

Statistical methods
(1) Stata 16 software was used for statistical analysis.(2) The study was evaluated using sensitivity, specificity, 95% CI, and receiver operating characteristic (ROC) curves; comparisons were performed using forest plot descriptive statistics.(3) Literature heterogeneity was analyzed using I 2 analysis, with I 2 > 50% indicating heterogeneity of the results.(4) If no heterogeneity was found among the studies, the fixed effect model was used.If heterogeneity was evident among the studies, the random effects model was used for calculation; (5) heterogeneity survey: subgroup analysis was used to investigate heterogeneity; (6) sensitivity analysis was performed to detect the literature most affecting the effect size on the Diagnostic Odds Ratio (DOR) value; and (7) publication bias was detected using Egger's test and presented using Deeks' funnel plot.

Literature screening
Figure 1 shows the PRISMA flow diagram of the study selection process.An exhaustive database search yielded 473 articles.Of these, 87 were eliminated through EndNote; the remaining 386 were screened for titles and abstracts.Following title and abstract screening, 285 publications were excluded, leaving 101 articles for full-text screening.The full texts of these 101 records were retrieved and reviewed for eligibility.For the reasons summarized in Figure 1, 82 articles were excluded.Ultimately, 19 studies were included in our Meta-analysis (22, 28-44).

Study characteristics
Table 2 lists the datasets extracted from selected articles.Briefly, they were published between 2021 and 2022 in Europe (n = 9), Asia (n = 4), North America (n = 3), South America (n = 1), and Africa (n = 1).Fifteen of the 18 studies were retrospective, in which past samples/images were evaluated; the remaining three articles were prospective.Clinical data were the most commonly used predictors (n = 12); a combination of clinical and imaging data were used to predict severe COVID-19 mortality (n = 8).The included studies had two main categories of AI models: machine learning (n = 9) and deep learning (n = 9).

Risk of bias
Bias assessment included the risk of bias and applicability sections.The results of the qualitative assessment of the included studies are shown in Table 3.
For bias risk assessment, six articles (22, 31, 37, 38, 40, 41) were determined to have a low risk, demonstrating robust methodologies and transparent reporting.Conversely, five articles (33-36, 39) were assigned a high risk of bias due to several factors.These included a lack of reporting on the appropriate handling of missing data and the    determination of outcomes without prior knowledge of the predictive factor information.These issues pose a potential threat to the validity of our results.The risk levels for seven articles (28-30, 32, 42-44) remained ambiguous owing to insufficient information.The articles did not explicitly state whether the assessment of predictive factors was conducted without knowledge of the outcome data and whether issues such as model overfitting and optimistic model performance could be explained.This underscores the need for more comprehensive reporting.With respect to the applicability assessment, the majority of articles (17 out of 18) were deemed to be of low risk, suggesting that their findings are likely relevant and applicable to the research context.However, one article (28) received an unclear rating due to insufficient evidence to confirm that the included participants and study setting aligned with the research question.

Fagan plot
The Fagan's nomogram was used to evaluate the diagnostic performance of the AI model in predicting mortality in severe COVID-19 cases.In Figure 6A, The pre-test probability, or the anticipated probability of mortality prior to the test results, was set at 20%.The likelihood ratios for positive and negative results for all the included articles were 5 and 0.30, respectively.These values yielded post-test probabilities of 56% for positive results and 7% for negative results.

Clinical data
Figure 6B illustrates that, given an initial expected mortality probability of 20% prior to testing, a prediction model solely utilizing clinical data as predictors yielded likelihood ratios of 8.0 for positive results and 0.28 for negative results.These values yielded post-test probabilities of 65% for positive results and 6% for negative results.

Imaging data
As shown in Figure 6C, under the presumption of a 20% mortality probability prior to testing, a prediction model solely utilizing imaging data as predictors yielded likelihood ratios of 3 for positive results and 0.58 for negative results.These values yielded post-test probabilities of 39% for positive results and 12% for negative results.

Combining clinical and imaging data
As shown in Figure 6D, with an assumed pre-test mortality probability of 20%, a prediction model that used a combination of clinical and radiographic data as predictors yielded likelihood ratios of 3 for positive results and 0.58 for negative results.These values yielded post-test probabilities of 39% for positive results and 12% for negative results.

Subgroup analyses
Meta-regression and subgroup analyses we conducted to explore the effects of different model types, economic income levels, study design methods, and settings in which all patients came from the ICU on the prediction power.
The results show that, within the context of the imaging data prediction model, high-income economies exhibited heightened prediction sensitivity (p < 0.05).Conversely, as depicted in Figure 7A, the model type, study design, and ICU status did not show significant disparities.Within the combined clinical and radiological prediction model, we observed several significant differences.Notably, when all patients were exclusively from the Intensive Care Unit, as well as in the case of artificial intelligence prediction models utilizing deep learning and models originating from high-income economies, a higher predictive specificity was demonstrated (p < 0.05) as illustrated in Figure 7B.

Sensitivity analysis
A sensitivity analysis was conducted to explore the robustness and consistency of the results.As shown in Table 4, the overall value of the combined effect of DOR did not change significantly after removing them individually from the model; the stability was good.However, in the model based on imaging data, after excluding the studies by Walston et al. (28) and Chrzan et al. (35) the overall I 2 changed from 55.6 to 29.30 and 2.70%, respectively, indicating that these two studies may be a source of heterogeneity (Tables 5, 6).In addition, among the eight studies based on the combination of clinical and imaging models, the overall heterogeneity was reduced from 96.2 to 58.6% after the exclusion of the study by Elghamrawy et al. (33) indicating that it may be the main source of heterogeneity (Table 7).

Publication bias
The Deeks' Funnel Plot Asymmetry Test, as depicted in Figure 8, was utilized to investigate the presence of publication bias.The studies, represented by individual data points, are not symmetrically distributed around the regression line (p < 0.001), suggesting the presence of publication bias or other small-study effects in the meta-analysis.

Evidence quality evaluation
The aforementioned four important outcomes in this metaanalysis were evaluated using the GRADE system.The evidence quality level for each outcome was found to be very low Table 8.Since all the included studies in this research had observational designs, the initial quality rating of the studies was assessed as "low." The final overall level of evidence was rated as "very low, " which may decrease the credibility of any recommendations.

Discussion
Since the emergence of COVID-19 in 2019, its global spread of novel coronavirus infection has caused significant disruptions in mortality patterns worldwide (45,46).The ongoing 2019 coronavirus disease pandemic has prompted substantial efforts to prevent and treat COVID-19, leading to remarkable achievements (47)(48)(49)(50).The World Health Organization ceased considering COVID-19 a global health emergency in May 2023.Despite these advancements, SARS-CoV-2 continues to persist, evolve, and threaten human life (51).Long COVID-19 is our big challenge (52, 53).Severe COVID-19 presents an ongoing and substantial risk to individual health and well-being, with mortality rates as high as 49% among critically ill individuals (54).Research exploring AI as an assistive tool for COVID-19 in this environment has gained prominence over the past 3 years (55)(56)(57).This study investigated this important topic.
AI is increasingly used in medicine (14).AI can autonomously search and extract intricate task-specific features, offering the advantage of cost-effectiveness (56,58).This systematic review and meta-analysis meticulously synthesized the available evidence regarding the performance of AI prediction models for severe COVID-19 mortality.In the included literature, the clinical data encompassed elements including demographic information (such as age and sex), comorbidities (hypertension, stroke, atrial fibrillation, etc.), laboratory data (albumin, hemoglobin, sodium, potassium, etc.), and imaging data, typically consisting of CT scans, X-rays, and their associated parameters.Integrating clinical and imaging data is pivotal in precision medicine and large-scale research programs, enabling a comprehensive understanding of disease patterns and facilitating targeted and effective interventions (59,60).The meta-analysis encompassed 25 prediction models, showing commendable sensitivity and specificity, with an area under the curve (AUC) of 0.74 (0.64-0.83) and 0.86 (0.76-0.92), respectively.Furthermore, the Fagan plot demonstrated commendable positive post-test probability (56%) and negative post-test probability (7%).This indicates that when the AI  Frontiers in Public Health 12 frontiersin.org of death.Overall, the results suggest that the AI prediction models hold promise for predicting severe COVID-19 mortality rates.However, 25 studies included in our analysis exhibited significant heterogeneity.To explore the sources of heterogeneity, we conducted sensitivity and subgroup analyses.The overall results showed that despite Khan et al. (41) having a significantly larger sample size than the others, the exclusion of this study did not result in a significant change in the diagnostic odds ratio value and its heterogeneity.This suggests that an imbalance in the dataset was not the primary source of heterogeneity.We also conducted a sensitivity analysis of different subgroups of predictive factors.We found that, in models based on radiological data, the exclusion of the study by Chrzan et al. (35) resulted in a substantial decrease in I 2 to 2.7% and an increase in the DOR value.This could be attributed to the unique nature of the predictive data used in this study, which were derived from High-Resolution Computed Tomography (HRCT) and its specific parameters such as absolute inflammation volume and absolute consolidation volume (ACV), as opposed to conventional CT scans or chest X-rays.We conjecture that HRCT, as a relatively novel diagnostic instrument, may not be universally applicable to all patient populations.This potential limitation could contribute to the observed decrease in the DOR value.
Additionally, we conducted subgroup analyses based on different model types, income levels, types of study designs, and ICU states.Our results indicate that studies employing deep learning, originating from high-income economies, or involving patients exclusively in the ICU, often exhibit higher specificity.A plausible explanation for this could be the urgent need for ICU patients and those in high-income economies to avoid death, which might cause the models to lean more toward predicting a higher risk of mortality.This could potentially enhance the model's ability to correctly identify true-negative cases, thereby increasing specificity.However, it is important to note that although this might improve the model's performance in an ICU setting, it may not necessarily work in other contexts.Furthermore, our findings suggest that studies employing deep learning models demonstrate higher specificity.This can be attributed to the inherent capabilities of deep learning models, which are adept at capturing complex nonlinear relationships in high-dimensional data.This allows them to discern subtle patterns that may be overlooked by traditional statistical models, thereby enhancing their ability to correctly identify true-negative cases, and consequently, increasing their specificity.However, it is important to note that although deep learning models can offer improved performance, their effectiveness is heavily dependent on the quality and diversity of the training data.Therefore, future studies should focus on ensuring the collection of comprehensive, high-quality datasets that capture a wide range of patient characteristics and clinical scenarios.Moreover, given the black-box nature of deep learning models, efforts should be made to improve their interpretability (61).
This study had limitations.First, the sample size of the included studies was relatively small, and only English literature was considered, potentially introducing language and publication biases.The exclusion of literature in other languages, including German, Japanese, and Korean, may have influenced the findings.Second, the funnel plot analysis revealed an asymmetric funnel distribution, suggesting the presence of publication bias.This bias could affect the overall interpretation of the results and the generalizability of the findings.Furthermore, the included studies exhibited high heterogeneity, which could affect the accurate assessment of the overall model performance    the objectives of tertiary prevention strategies.To promote the integration of AI and hospital digitalization, conducting highquality, large-scale, multicenter studies is imperative.These studies advance the field of AI in healthcare and foster its effective implementation.

Conclusion
In summary, while artificial intelligence has shown promise in predicting severe mortality rates in COVID-19, the suitability of different models varies for specific populations, and there is still room for improvement in their predictive performance.In addition, The full applicability of these models in clinical practice remains uncertain.Therefore, ongoing research and development efforts are necessary to enhance the performance of these models.The application of AI extends beyond COVID-19; it is being utilized in diagnosing, screening, and managing other diseases, aligning with the objectives of tertiary prevention strategies.Therefore, conducting high-quality, large-scale, multicenter studies is imperative for advancing the field of AI in healthcare and ensuring its effective implementation.

( 1 )
Research class: all articles in this study were published in English.(2) Study subjects: the participants were all patients aged >18 years diagnosed with COVID-19 and were either fully or partially treated in the ICU.(3) Patient grouping: deceased or surviving.(4) Outcome indicators: the actual number of deaths, actual number of survivors, at least two of AUC, sensitivity, specificity, accuracy, and F1-score were provided.(5) Study type: Cohort or case-control studies.

FIGURE 1 PRISMA
FIGURE 1PRISMA flow diagram of study selection.

FIGURE 3
FIGURE 3 Sensitivity, specificity, and ROC curve based on clinical data.(A) Sensitivity and specificity based on clinical data model, (B) ROC curve based on clinical data.

FIGURE 4 Sensitivity
FIGURE 4 Sensitivity, specificity, and ROC curve based on imaging data.(A) Sensitivity and specificity based on imaging data model, (B) ROC curve based on imaging data.

FIGURE 5
FIGURE 5 Sensitivity, specificity, and ROC curve based on combined clinical and imaging data.(A) Sensitivity and specificity based on combined data, (B) ROC curve based on combined data.

FIGURE 6 Fagan
FIGURE 6 Fagan Plot.(A) Overall Fagan plot, (B) Fagan plot based on clinical data model, (C) Fagan plot based on imaging data model, (D) Fagan plot based on combined clinical and imaging data model.

FIGURE 7 Meta
FIGURE 7Meta-regression and subgroup analysis.(A) Based on imaging data prediction models, (B) Based on combination of imaging and clinical data prediction models.

TABLE 1
Literature search strategy on PubMed.

TABLE 2
Characteristics and extracted data from included studies.

TABLE 3
Assessment of bias by PROBAST.

TABLE 4
Sensitivity analysis of included 25 studies (total).

TABLE 5
Sensitivity analysis of included 12 studies (based on clinical data).Continued research and development are necessary to enhance the sensitivity and specificity of prediction models.Moreover, AI applications of artificial intelligence extend beyond predicting COVID-19 prognoses.They are increasingly employed in diagnosing, screening, and managing other diseases, aligning with and limit reliable inferences for specific subgroups.Variations in study design, patient characteristics, and data sources may have contributed to heterogeneity.Although AI demonstrates promising outcomes in predicting severe COVID-19 mortality, it has ample scope for improvement.

TABLE 6
Sensitivity analysis of included five studies (based on imaging data).

TABLE 7
Sensitivity analysis of included eight studies (based on the combined data).Deek's funnel plot was used to evaluate publication bias.